12 research outputs found

    Cost-effective data structural preparation

    Get PDF
    People structure and represent their data in many different ways. One factor to consider in choosing between different representations is how the structure will affect the effectiveness of algorithms that run over the data. In fact, before sophisticated analytics can be performed, one must usually go through a data preparation phase, where the structural representation of the data is changed to be more suitable for the particular analytics procedure that will be performed. This is necessary because individual analytics algorithms are effective only for certain kinds of structural representations of their input data. Unfortunately, analytics algorithms do not come with a clear description of their desired representation. Hence, time and expertise is required to identify and materialize a suitable representation for each analytics task. In this dissertation, we address this issue in data preparation. Our first contribution focuses on the concept of design independence, in which the intent is to create an analytics algorithm that is effective regardless of the choices of data representations. The benefit of becoming more design independent is that it will reduce or, in the most favorable outcome, remove the cost of manually finding and preparing the most effective structure or schema for the data. In this part of our work, we consider common variations of data source structure that preserve its content. For the analytics task of similarity search, we propose an algorithm that satisfies the design independence property against the studied variations. We then generalize our findings for other structural variations, and prove that it is design independent with respect to these structural variants. We show that humans find its answers at least as desirable as those provided by existing similarity search algorithms. In the case where design independence is not achievable, we address the data preparation issue by proposing an algorithm that finds a cost-effective structure to be imposed on an unstructured dataset. Under this approach, structural information is added to the data source to improve the effectiveness of an algorithm running over the data. We leverage the information from an existing domain of concepts or an ontology to add structure to the data collection in the form of annotations. Because each concept may require different amounts of resources and time in annotating and/or maintaining the data source, we would like to find a set of affordable concepts that improves the effectiveness of an algorithm the most. This is called the cost-effective conceptual design problem. Previous works on this topic assumed that a domain of concepts is simply an unorganized set of concepts. However, real-world domains are often organized, in the form of taxonomies for example. Hence, in this dissertation, we explore a new version of the cost-effective conceptual design problem, using taxonomies of concepts and considering multi-concept queries

    Representation Independent Analytics Over Structured Data

    Full text link
    Database analytics algorithms leverage quantifiable structural properties of the data to predict interesting concepts and relationships. The same information, however, can be represented using many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, there is no guarantee that current database analytics algorithms will still provide the correct insights, no matter what structures are chosen to organize the database. Because these algorithms tend to be highly effective over some choices of structure, such as that of the databases used to validate them, but not so effective with others, database analytics has largely remained the province of experts who can find the desired forms for these algorithms. We argue that in order to make database analytics usable, we should use or develop algorithms that are effective over a wide range of choices of structural organizations. We introduce the notion of representation independence, study its fundamental properties for a wide range of data analytics algorithms, and empirically analyze the amount of representation independence of some popular database analytics algorithms. Our results indicate that most algorithms are not generally representation independent and find the characteristics of more representation independent heuristics under certain representational shifts

    Cost-effective data structural preparation

    No full text
    People structure and represent their data in many different ways. One factor to consider in choosing between different representations is how the structure will affect the effectiveness of algorithms that run over the data. In fact, before sophisticated analytics can be performed, one must usually go through a data preparation phase, where the structural representation of the data is changed to be more suitable for the particular analytics procedure that will be performed. This is necessary because individual analytics algorithms are effective only for certain kinds of structural representations of their input data. Unfortunately, analytics algorithms do not come with a clear description of their desired representation. Hence, time and expertise is required to identify and materialize a suitable representation for each analytics task. In this dissertation, we address this issue in data preparation. Our first contribution focuses on the concept of design independence, in which the intent is to create an analytics algorithm that is effective regardless of the choices of data representations. The benefit of becoming more design independent is that it will reduce or, in the most favorable outcome, remove the cost of manually finding and preparing the most effective structure or schema for the data. In this part of our work, we consider common variations of data source structure that preserve its content. For the analytics task of similarity search, we propose an algorithm that satisfies the design independence property against the studied variations. We then generalize our findings for other structural variations, and prove that it is design independent with respect to these structural variants. We show that humans find its answers at least as desirable as those provided by existing similarity search algorithms. In the case where design independence is not achievable, we address the data preparation issue by proposing an algorithm that finds a cost-effective structure to be imposed on an unstructured dataset. Under this approach, structural information is added to the data source to improve the effectiveness of an algorithm running over the data. We leverage the information from an existing domain of concepts or an ontology to add structure to the data collection in the form of annotations. Because each concept may require different amounts of resources and time in annotating and/or maintaining the data source, we would like to find a set of affordable concepts that improves the effectiveness of an algorithm the most. This is called the cost-effective conceptual design problem. Previous works on this topic assumed that a domain of concepts is simply an unorganized set of concepts. However, real-world domains are often organized, in the form of taxonomies for example. Hence, in this dissertation, we explore a new version of the cost-effective conceptual design problem, using taxonomies of concepts and considering multi-concept queries.U of I OnlyAuthor requested U of Illinois access only (OA after 2yrs) in Vireo ETD syste

    How schema independent are schema free query interfaces?

    No full text
    Abstract—Real-world databases often have extremely complex schemas. With thousands of entity types and relationships, each with a hundred or so attributes, it is extremely difficult for new users to explore the data and formulate queries. Schema free query interfaces (SFQIs) address this problem by allowing users with no knowledge of the schema to submit queries. We postulate that SFQIs should deliver the same answers when given alternative but equivalent schemas for the same underlying information. In this paper, we introduce and formally define design independence, which captures this property for SFQIs. We establish a theoretical framework to measure the amount of design independence provided by an SFQI. We show that most current SFQIs provide a very limited degree of design indepen-dence. We also show that SFQIs based on the statistical properties of data can provide design independence when the changes in the schema do not introduce or remove redundancy in the data. We propose a novel XML SFQI called Duplication Aware Coherency Ranking (DA-CR) based on information-theoretic relationships among the data items in the database, and prove that DA-CR is design independent. Our extensive empirical study using three real-world data sets shows that the average case design independence of current SFQIs is considerably lower than that of DA-CR. We also show that the ranking quality of DA-CR is better than or equal to that of current SFQI methods. I

    Cost-effective conceptual design using taxonomies

    No full text
    It is known that annotating entities in unstructured and semi-structured datasets by their concepts improves the effectiveness of answering queries over these datasets. Ideally, one would like to annotate entities of all relevant concepts in a dataset. However, it takes substantial time and computational resources to annotate concepts in large datasets, and an organization may have sufficient resources to annotate only a subset of relevant concepts. Clearly, it would like to annotate a subset of concepts that provides the most effective answers to queries over the dataset. We propose a formal framework that quantifies the amount by which annotating entities of concepts from a taxonomy in a dataset improves the effectiveness of answering queries over the dataset. Because the problem is NP-hard, we propose efficient approximation and pseudo-polynomial time algorithms for several cases of the problem. Our extensive empirical studies validate our framework and show accuracy and efficiency of our algorithms.National Science Foundation (Grants IIS-1421247, CCF-0938071, CCF-0938064 and CNS-0716532
    corecore